Rigdelet neural network and improved partial reinforcement effect optimizer for music genre classification from sound spectrum images

In this paper, a new approach has been introduced for classifying the music genres. The proposed approach involves transforming an audio signal into a unified representation known as a sound spectrum, from which texture features have been extracted using an enhanced Rigdelet Neural Network (RNN). Additionally, the RNN has been optimized using an improved version of the partial reinforcement effect optimizer (IPREO) that effectively avoids local optima and enhances the RNN's generalization capability. The GTZAN dataset has been utilized in experiments to assess the effectiveness of the proposed RNN/IPREO model for music genre classification. The results show an impressive accuracy of 92 % by incorporating a combination of spectral centroid, Mel-spectrogram, and Mel-frequency cepstral coefficients (MFCCs) as features. This performance significantly outperformed K-Means (58 %) and Support Vector Machines (up to 68 %). Furthermore, the RNN/IPREO model outshined various deep learning architectures such as Neural Networks (65 %), RNNs (84 %), CNNs (88 %), DNNs (86 %), VGG-16 (91 %), and ResNet-50 (90 %). It is worth noting that the RNN/IPREO model was able to achieve comparable results to well-known deep models like VGG-16, ResNet-50, and RNN-LSTM, sometimes even surpassing their scores. This highlights the strength of its hybrid CNN–Bi-directional RNN design in conjunction with the IPREO parameter optimization algorithm for extracting intricate and sequential auditory data.


Introduction
A music genre serves as a customary classification that designates musical compositions as part of a collective tradition or a specific set of customs [1].It provides a means of organizing music into distinct groups that possess comparable attributes, although the standards for defining genres may differ [2].Music genres are frequently classified based on a range of stylistic criteria, including musical techniques, context, content, and the overall essence of the music [3].
Classification of music genres plays a vital role concerning MIR (Music Information Retrieval) and is essential for improving music recommendation systems [4].It involves organizing musical compositions into distinct genres, which serve as descriptors for the stylistic and structural attributes of the music [5].These genres encompass a wide range of styles, including rock, jazz, classical, hip-hop, and electronic, among others [6].
The significance of music genres goes beyond mere categorization.They assist users in navigating vast music libraries, enabling efficient sorting and retrieval of tracks that align with their personal preferences, moods, and occasions [7].This capability is particularly important for user experience, especially in streaming platforms and digital music services, where listeners enjoy personalized playlists and discovery features tailored to their specific musical tastes [8].
However, the process of music genre classification is complex.Unlike other classification tasks, the categorization of music into genres is highly subjective.What one person may classify as rock, others may interpret as alternative or indie, leading to inconsistencies and variations in classification.This subjectivity is further complicated by the inherent ambiguity of music genres [9].Boundaries between genres are not always clearly defined, and many musical compositions exhibit characteristics of multiple genres, resulting in overlaps that complicate straightforward classification.
The challenge of classifying music is further complicated by its intricate nature.To accurately categorize a genre, one must identify and analyze various auditory elements, including tempo, rhythm, melody, harmony, and timbre.Tempo determines the speed and energy of a piece, while rhythm establishes its flow.Melody and harmony contribute to the distinct tune and chord progressions that make music catchy.Timbre, on the other hand, distinguishes different sounds, even if they share the same pitch and loudness, and plays a crucial role in recognizing unique voices or instruments.Lyrics also provide thematic and linguistic cues that are sometimes essential for defining a genre.
However, accurately measuring or quantifying these elements can be challenging.
Classifying music genres provides an important impact on the Music Information Retrieval (MIR) that aids in organizing and suggesting music based on the listener's preferences.However, accurately classifying genres remains a difficult problem due to the diverse and complex nature of music genres, the presence of noise and variations in audio signals, and the reliance on manually selected features that may not fully capture the essential characteristics of genres.
Traditional signal processing techniques often fail to capture the nuances of these musical qualities, and interpreting lyrics requires advanced natural language processing capabilities.As a result, recent advancements in music genre classification heavily rely on machine learning and artificial intelligence.These technologies excel at handling vast datasets and learning complex patterns that may elude human curators.
Current approaches utilize sophisticated algorithms that not only process metadata but also analyze raw audio features using signal processing methods like the Fourier Transform or the creation of Mel spectrograms [10].Deep learning approaches, like CNN (Convolutional Neural Networks), show promise in analyzing spectrograms and audio patterns to classify genres with higher accuracy.Despite these advancements, music genre classification remains an ongoing field of research due to the development of novel sub-genres and genres and the ever-evolving nature of music.
Foleis et al. [11] conducted a study for evaluating the influence of texture choice on automated categorization of music genre.A new selector of texture was presented on the basis of K-Means that its purpose was recognizing various textures of sound in all tracks.The outcomes revealed that taking variation in each track was really essential for enhancing efficacy of categorization.The findings represented that the selector of texture on the basis of K-Means could accomplish considerable enhancements compared to the baseline that had less textures in each track in comparison with several selectors of texture that were assessed.It was also indicated that utilizing several representations of texture let more opportunities for selection of attribute for enhancing efficiency of categorization.
In order to address the issues of previous studies regarding the classification of music, a research was carried out by Ashraf et al. [12].Within the current study, GLR (Global Layer Regularization) method was suggested regarding the hybrid system of RNN and CNN by the use Mel-spectrograms to assess accuracy of training.The study could enhance efficacy of FMA (Free Music Achieve) and GTZAN datasets.The accuracy value of the FMA and GTZAN were, in turn, 68.87 % and 87.79 %. the suggested system could benefit from spatiotemporal attributes of region and regularization approach of global layer for gaining the accuracy that was reliable in comparison with other studies.
In similar vein, Liu et al. [13] entirely exploited information that was low-level through audio's spectrograms and developed a new CNN methodology.The suggested model considered data of multi-scale time-frequency that transferred more appropriate attributes semanticity for the layer that makes decision to distinguish various genres of music that were not known.The outcomes were assessed in accordance with datasets of objective, comprising Ballroom, Extended Ballroom, and GTZAN that their values of accuracy were, in turn, 96.7 %, 97.2 %, and 93.9 %.Sharma et al. [14] utilized two various method for implementing models of categorization.Mel Frequency Cepstral Coefficients (MFCC) were utilized as attributes and carried out systems, such as CNN (1, 2, and 3 Layers), DNN (1, 2, and 3 Layers), and LSTM-RNN, SVM (Polynomial, Gaussian Kernel, and Sigmoid) as method one.An input that had three channels were made via integrating attributes, such MFCCs, Scalogram, and Spectrogram.Moreover, it carried out some systems, such as Convolutional Neural Network (1, 2, and 3 Layers), ResNet-50, and VGG-16 as the second method.It was illustrated in the research that RNN-LSTM and three-layered Convolutional Neural Network could outperform each method.In the end, the findings of the study revealed that attributes of MFCCs were more efficient compared to input of three channels.Hence, MFCCs could perform better that its accuracy value was 96.08 %.Additionally, the accuracy of the both 1-layer CNN and RNN-LSTM were identical 96.08 %; however, their loss of validation were, in turn, 0.1111 and 0.1356.
Singh and Bohat [15] implemented a research to make a system to suggest pieces of music that assisted people to offer music in an automated way on the basis of likeness of music that are accessible.Different attributes of music were extracted by the suggested model.Moreover, that model was utilized for forecasting the music's grade as suggestion of music.The suggested model was evaluated by distinct assessment variables, such as recall, F1 measure, and precision.The outcomes illustrated that the recommended system could outperform the baseline method.
Existing approaches to music genre classification often depend on the manual selection of features.This involves choosing relevant and representative characteristics from the audio signal, such as spectral, temporal, or cepstral features.However, manual feature selection has various limitations.Firstly, it is a time-consuming process that demands domain knowledge and expertise.Secondly, it may fail to capture the fundamental characteristics of music genres, as different genres can exhibit distinct feature distributions and correlations.Lastly, it can introduce noise and redundancy by including irrelevant or redundant features for classifying various genres of music.
For overcoming the limitations of manual feature selection, we present a new method for music genre classification.This method involves converting an audio signal into a sound spectrum, which serves as a comprehensive representation.From these timefrequency images, texture features are extracted using an enhanced Rigdelet Neural Network (RNN).The sound spectrum is a twodimensional image that captures the energy distribution of the audio signal across time and frequency.
This research offers some contributions to the current state-of-the-art in music genre classification.Firstly, it eliminates the need for manual feature selection, saving time and potentially yielding more accurate results.Secondly, it utilizes the sound spectrum, preserving both temporal and frequency information in the audio signal.Thirdly, it employs an enhanced Ridgelet Neural Network (RNN) that effectively learns complex and high-level features from the sound spectrum, combined with an improved IPREO algorithm that optimizes the RNN efficiently.Fourthly, our approach demonstrates robustness to noise and variations in the audio signal, enabling it to handle different music genres with distinct characteristics.Finally, we compare our proposed method with several state-of-the-art methods and demonstrate its superiority in terms of accuracy and efficiency.

Preprocessing
In the following, there are preprocessing stages that utilized in our research.

Signal pre-emphasis
This initial processing stage involves the application of a high-pass filter to the audio signal in order to boost the high-frequency components relative to the low-frequency components.This is crucial for compensating for the natural decrease in high-frequency components in recorded audio signals.The pre-emphasis filter can be created using a finite impulse response (FIR) filter with a transfer function of the form: where, α is a scalar constant that regulates the cutoff frequency of the filter.Typically, a value of α = 0.97 is utilized for music signals.

Frame blocking
This preprocessing step entails dividing the audio signal into overlapping frames for analysis.Framing is essential because audio signals are non-stationary, implying that their statistical characteristics change over time.By segmenting the signal into smaller frames, we can assume that the statistical properties of each frame remain relatively constant.The frame size is usually selected to be around 20-30 ms, with a 50 % overlap.

Hamming windowing
This preprocessing stage involves the application of a window function to each frame to reduce the impact of spectral leakage caused by the rectangular window used in the previous step.The Hamming window is a tapered window function that smoothly transitions from zero to unity and back to zero over the duration of the frame.The Hamming window can be mathematically represented as: where, n is the sample index within the frame, and N is the frame size.

Feature detection of music signal
The spectral centroid is an attribute that quantifies the "center of gravity" or the "brightness" of a sound spectrum.It is frequently employed in the analysis of music and the classification of genres, as it can capture the timbre or tonal quality of a sound.For instance, a sound with a higher spectral centroid will have a brighter perception compared to a sound with a lower spectral centroid.
To calculate the spectral centroid, one must compute the weighted average of the frequencies existing within the spectrum.The weights assigned to each frequency bin are determined by their magnitudes.Mathematically, the spectral centroid at frame t has been illustrated in the following way: The magnitude of the frequency bin k at frame t is denoted as S[k,t], while the center frequency of the bin k is represented as freq [k].The denominator in this equation refers to the sum of all the magnitudes in the frame, which serves to normalize the spectrum and create a distribution across frequency bins.
The spectral centroid can be calculated either from a time-domain signal or a frequency-domain representation, such as a spectrogram.In the time domain, the spectral centroid can be estimated by utilizing the zero-crossing proportion, which measures the quantity of times the signal alters mark in a given frame.The zero-crossing rate is directly related to the spectral centroid, as a higher rate indicates a higher frequency content.However, the zero-crossing rate lacks accuracy as it does not consider the signal's amplitude or shape.
On the other hand, in the frequency domain, the spectral centroid can be computed from a spectrogram, which provides a twodimensional depiction of the signal's energy distribution across time and frequency.For obtaining the spectrogram, an STFT (Short-Time Fourier Transform) has been employed within the signal, dividing it into overlapping frames and calculating the DFT (Discrete Fourier Transform) for all frames.The Discrete Fourier Transform generates a complex-valued vector of length n, where n represents the window size of the STFT.
The magnitude of each vector element represents the energy of a frequency bin, while the angle corresponds to the phase.By utilizing the aforementioned formula, the spectral centroid can be determined.In this formula, S[k, t] denotes the magnitude of the k-th element of the DFT vector at frame t, and freq[k] represents the center frequency of bin k and is achieved as follows: where, sr describes the signal sampling ratio.
The spectral centroid can be calculated using a mel spectrogram as well.A mel spectrogram utilizes a nonlinear frequency scale that imitates the way humans perceive sound.The mel scale is established on the concept that the perceived pitch of a sound is not directly proportional to its frequency, but rather follows a logarithmic pattern.The mel scale can be defined as follows: The mel spectrogram has been found to be an indication of a signal's frequency content in mel units.Here, f illustrates the frequency in Hertz and mel(f) represents the frequency in mel units.To obtain the mel spectrogram, a mel filter bank is applied to the spectrogram.
This filter bank consists of a sequence of triangular filters that cover the occurrence range of the signal.These filters are evenly spaced on the mel scale and their heights are normalized to have a unit area.The mel spectrogram is then obtained by multiplying the spectrogram with the mel filter bank.The spectral centroid, which is a measure of the center of mass of the frequency distribution, can be computed from the mel spectrogram using the provided formula.In this formula, S[k, t] represents the magnitude of the k th mel bin at frame t, and freq[k] represents the center frequency of the bin k that might be computed via the subsequent equation: where, mel[k] specifies the center frequency of the bin k in mel units, which can be obtained from the mel filter bank.The spectral centroid proves to be a valuable attribute in the classification of music genres, owing to the distinct spectral centroids observed across different genres.Classical music, for instance, typically exhibits a lower spectral centroid compared to rock music, primarily due to its abundance of low-frequency components and relatively fewer high-frequency components.
Moreover, even within a particular genre, the spectral centroid can exhibit variations influenced by factors such as the choice of instruments, tempo, dynamics, and overall mood of the music.Leveraging the spectral centroid as an input, machine learning machines like Support Vector Machines, Neural Networks, or decision trees can effectively classify music genres based on their unique  sound spectra.
On the other side, MFCCs, known as Mel-Scale Frequency Cepstral Coefficients, serve as a feature extraction technique that effectively captures the spectral attributes of a sound signal.These coefficients find extensive application in the realms of music analysis and categorization of genres.The underlying principle behind MFCCs lies in the understanding that human perception of sound does not adhere to a linear relationship with frequency, but rather follows a logarithmic pattern.Consequently, MFCCs adopt a non-linear frequency scale, referred to as the mel scale, which closely emulates the auditory system of humans.For example, consider a music (champion.mp3).Fig. (1) shows the signal waveform.MFCCs are another feature that are used in this research.MFCCs created by dividing a signal into overlapping frames, applying a window function, computing the discrete Fourier transform, applying a mel filter bank to the magnitude spectrum, compressing the dynamic range, applying the discrete cosine transform to the bank outputs of log filter, and retaining merely a subset of the cepstral coefficients as final features, representing coarse spectral shape and fine spectral details. where, ) n where, where

Rigdelet neural network
Deep learning models utilized in music genre classification often integrate different methodologies, such as employing CNNs [16] Fig. 2. Feature extraction by the spectral centroids for this signal waveform.
F. Wang et al.
to extract spatial features and RNNs for temporal analysis, in order to enhance accuracy levels [17].By using these techniques, automated music categorization becomes feasible, enabling applications like music recommendation systems, digital libraries, and other platforms that benefit from identifying the genre of a particular track [18].
The current model was suggested by Cands E.J. and Donoho D.L.In order to design the current strategy, the theory of the Littlewood-Paley has been employed [19].Moreover, it has been employed via analysis of wavelet and of novel analysis that is harmonic.The present approach has been explained via the subsequent equation: The criteria should be fulfilled via ρ : here, D-dimensional permitted neural function has been illustrated by ρ.The criteria of ρ γ has been fulfilled by the subsequent formula: where, γ has been involved within neuron parameter area λ: where, the parameter Υ equals (c, u, e) and possesses normal explanation.The orientation, the location of Rigdelet have been, in turn, determined by c, u, and e.Eventually, the Ridge function that has been generated by a permitted neural function of activation has been named Ridgelet.
The approximation procedure of multivariable functions involves the use of ridge function iterations to create various suitable functions that exhibit heterogeneity within space.The current approximation kind has been considered to faster compared to wavelet and Fourier transform regarding speed.When, y equals f (z) : Q D → Q H and has been divided into H projection ratio Q D → Q, Rigdelet functions are used for elementary variable.It has been represented by subsequent equation: The current model has been developed according to the formula above.The current algorithm's model has been demonstrated in Fig. (4).
Practical output, perfect output, and input have been, in turn, represented by y j , ŷj , and z j , g j,i has been considered to be weight that links ridge function i to output node j, and g j0 determines threshold quantity of j th benchmark node.Regarding each j = 1, …, N, the output of j th concealed unit has been expressed as w j = [w j1 ,…,w jP ].The formula has been calculated in the following way regarding all instances of p.
The cost function has been ascertained via the subsequent equation: In this study, a modified metaheuristic algorithm, called Improved Partial Reinforcement has been utilized for optimizing the structure of the RNN by minimizing the cost function in Eq. ( 18).

Partial reinforcement effect (PRE) optimizer
The diverse reinforcement calendars that were able to influence an organism's manner differently are represented in some studies.Their study indicated that the reinforcement approaches and timing significantly impacted the manners' power and stability.This theory's foremost perceptions are presented in the following section.

Reinforcement schedules
The reinforcement process is complex, and a lot of parameters could impact the way that rapidly and efficiently a novel proficiency is preserved.the reinforcement's timing and frequency are impacted by in what way rapidly and effortlessly manners are educated by apprentices.Precisely, how old and novel manners are trained is contingent on the frequency and time of the reinforcement procedure.Skinner advanced numerous different reinforcement plans, which could impact the procedure of agent conditioning.These reinforcement schedules are designated in the following parts.
-Continuous reinforcement: the current reinforcement kind consists of distributing reinforcement once an answer happens.Education is generally comparatively quick; however, the response ratio is properly low.Furthermore, destruction happens rapidly once improvements are discontinued.-Fixed-ratio schedules: This is a kind of restricted reinforcement.Here, the members' manners have been reinforced occasionally once a definite number of answers have been received.It generally gets produced within a fairly stable answer ratio.-Fixed-interval schedules: the current reinforcement kind has been merely utilized once a definite quantity of time has passed.The answer ratio rests rather stable and starts to surge as the reinforcement time methods, however, it upsurges gradually as the reinforcement has been supplied.-Variable-ratio schedules: Within the current kind of restricted reinforcement, after diverse numbers of responses, a manner is strengthened.It leads to an upper rate of answer and a slower rate of extinction.-Variable-interval schedules: Skinner defines the variable interval schedules as the last type of restricted reinforcement.Here, reinforcement has been provided once a parameter epoch has been prepared.It has a tendency to consequence in a quicker response ratio and a slower destruction ratio With eradicating what is disagreeable or disapproving, the response is reinforced in these circumstances.Such as, if your kid commences yelling in a shop and then finishes whenever they are given a treat, it is more possible to give the child a treat whenever they yell for a second time.Subsequently, your manner eradicates unfriendly situations (the crying of a child) and negatively reinforces your manner.Whenever the dog as a learner is encouraged with a bell to bring the ball as an answer, the dog would be positively strengthened and bring the ball.Otherwise, the animal would obtain negative reinforcement.

Offered Partial Reinforcement Optimizer (PRO)
In the current part, the Partial Reinforcement Optimizer would be designated based on more factors.Firstly, the algorithm's preliminaries will be clarified, counting in what way it is stimulated by and demonstrated on the basis of the Partial Reinforcement Optimizer.Then, the Partial Reinforcement Optimizer will be designated in detail.
For an illustration of the PRE theory's rules and perceptions of the Partial Reinforcement Optimizer's components, it is necessary for modeling the Partial Reinforcement Optimizer, which is known as the optimization algorithm.The next norms have been regarded: Learner: A person or animal has been known as the learner whose manners must be qualified/enhanced with the PRE theory that has been introduced as a solution.
Behavior: The manner of the learner is regarded in the role of a decision parameter solution.Precisely, a solution, that is learner, is a vector of decision parameters that is manners.
Population: within the current algorithm, a set of pupils creates a population.Every single row signifies a solution (learner) and every component Z d i signifies a decision variable (manner).Fitness (cost value) assessment: The fitness of manners for every learner Z i is computed with an operator-defined cost value f Zi ← F (Z i ) on the decision variables as manners

Time (Interval):
The iterations' amount is among 2 motivations, assessments, or strengthening stages, which requires the decision variables that are manners of a pupil throughout the search procedure, has been regarded as a time intermission.A counting instrument is utilized in the offered algorithm regarding a manner with a greater score possesses a big chance of being strengthened within the subsequent iterations.
Response: Attaining a lot of responses is a foremost aim.In the current search, fruitful enhancement is defined as a response in the cost value amount.Consequently, if F(X ʹ ) is less than F(X), here F(X') and F(X) are the present and the former cost value amounts of a definite solution after the motivation stage, in turn.
Schedule: The schedule's perception is defined in what way and once manners must be strengthened, being demonstrated for an information organization throughout diverse terms.Every single learner has a particular schedule.Subsequently, every single scaler signifies the score of a particular pupil's manner using a greater score and possesses a big chance of being designated within the following iteration.Moreover, the parameter-term scheduling outline is demonstrated as a dynamic instrument to comportment a random investigation with Eqs. ( 19)- (21).In what way and once manners must be strengthened is a variable and is renewed throughout the search procedure.Consequently, a subsection of manners γ ⊆ [1, 2, 3, …, N], using the maximum primacies within i th schedule is designated at every iteration for lrnr i .Along these lines, initially, in the schedule, priorities are arranged in descendent instruction, then, the 1st δ number of matters is nominated as the candidate manner with the next equation.τ← FEs MaxFEs ( 21) Here the time is defined by τ, the quantity of assessments of function has been determined via FEs and the max quantity of assessments of function has been considered to be MaxFEs.Furthermore, SR is the collection ratio, γ is a subsection of manners nominated on the basis of scheduling, δ is the nominated subsection's size, and N is the whole number of manners that are decision variables.Schdl * signifies the schedule using organized priorities and Schdl * ,δ is the δth element within the Schdl * .Stimulation: Each effort for encouraging a pupil's manner to stimulate a response has been demonstrated via utilizing processes altering the offered solution's decision parameters.It is worth noting that any process could be utilized to encourage (alter) the manner of learners (decision variables).
The next processes are applied to produce novel solutions in the PRO algorithm, which is calculated by the next equations.
F. Wang et al.
here, the incentive factor is defined by SF i and α is the normalized score or priority's mean of the nominated decision parameters for ith pupil according to its scheduler.
Reinforcement: For conceptualizing this issue, the next mechanism is utilized to renew the scheduling.Then, positive reinforcement has been employed for surging the score in a particular manner.The cost value of the learner, the following equation is utilized as a response after the enhancement in the incentive stage.
Here, the reinforcement ratio is defined by RR, Schdl γ i signifies primacies of the nominated decision parameters as manners for the ith solution as a pupil.Conversely, negative reinforcement is utilized once a response does not exist.In these circumstances, the cost value of a learner drops afterward the incentive stage, causing a reduction in the particular manner's score.The decision variables (manners) using greater scores would be nominated for incentive and reinforcement in the following iteration.
Rescheduling: This perception denotes the procedure of employing a novel schedule of a pupil throughout teaching, once the pupil constantly obtains negative reinforcement of total manners.Within the current circumstance, the current algorithm applies standard deviation (Std) of the schedule as a measurement tool for defining whenever it must reorganize the learner.Next equations define this mechanism.
Here, Std( Schdl i ) is the schedule's standard deviation of ith learner, L and U are the lower and upper boundary, in turn.Furthermore, U(0, 1) and U(L, U) denote stochastic amounts using a uniform distribution that is from (0, 1) to (L, U).

The improved PRE
One potential motive for altering the PRE algorithm is to enhance its effectiveness and resilience when dealing with intricate and high-dimensional optimization problems.The initial PRE algorithm may encounter challenges such as premature convergence, stagnation, or sluggish convergence when confronted with certain problems, particularly those characterized by deceptive, multimodal, or noisy landscapes.Consequently, certain adjustments may be necessary to augment the algorithm's capacity for exploration and exploitation, as well as to accommodate diverse problem characteristics.
The current study introduces a novel modification aimed at resolving these concerns, leading to improved precision and comprehensive results.This enhancement entails the implementation of a fractional modification.In order to have a deeper comprehension of this technique, a concise elucidation of Fractional calculus (FC) is presented.
FC, or function composition, has recently gained recognition as a highly effective approach for improving the efficacy of algorithms.The FC technique provides a systematic methodology that incorporates inherited traits, taking into account both the procedure and memory aspects.Fractional calculus has been considered to be a valuable technique to improve the efficacy of algorithms via considering perspective of memory while renewing solutions.The GL (Grunwald-Letnikov) technique has been found to be a frequently employed model of fractional calculus (FC).Below is the step-by-step process of this mechanism.
where, ( The gamma function, represented as Γ(t), is employed in the computation of the GL fractional derivative of order σ.The derivative, denoted as D σ (Z γ i ) is defined by the subsequent expression: The memory length, also known as the memory window, is controlled by the value of N. The sampling time is set by T, and σ is the operator for the derivative order.Assuming σ equals 1, the equation that came before can be restated as: where, S 1 [Z γ i ] determines the alteration between the two following movements.FC memory is employed to enhance the positioning of the algorithm using the following methods: The basic equation is derived in the following manner: Based on the provided equation, the algorithm and assuming g = 4, it is modified in the following manner:

Algorithm validation
The efficacy of the suggested IPRE algorithm was assessed utilizing the CEC-BC-2019 test benchmark.There are ten benchmark functions that tackle various problem contexts.It was specifically designed for "The 100-Digit Challenge", an annual optimization competition that aims to find the best efficient solution for each function with an accuracy of 100 decimal places.
Among the functions, the CEC01 possesses two dimensions, while the CEC02 encompasses three dimensions, and the CEC03 encompasses five dimensions.The remaining functions within the suite are characterized by having 100 dimensions.The functions can be categorized into four primary classifications, namely unimodal, multimodal, composition, and hybrid.Unimodal functions, specifically CEC01 and CEC02, are characterized by having only one optimal solution and do not provide any additional feasible alternatives.
In contrast, multimodal functions such as CEC03 and CEC04 offer multiple high-quality solutions in addition to the optimal solution.Hybrid functions, exemplified by CEC05 and CEC06, are composed of a combination of two or more unimodal or multimodal functions.Composition functions, ranging from CEC07 to CEC10, are constructed by aggregating numerous elementary functions through weighted summation.
An assessment was carried out to assess the efficiency of the Modified Gooseneck Barnacle Optimizer in handling operations.This study encompassed the utilization of several sophisticated algorithms, namely Owl Search Algorithm (OSA) [20], Squirrel search algorithm (SSA) [21], BBO (Biogeography-Based Optimizer) [22], LS (Locust Swarm Optimizer) [23], and Wildebeest herd optimization (WHO) [24].Table 1 illustrates the utilized parameter values of the algorithms.
It should be note that the parameter setting for the algorithms are kept close to their default values in their main paper to provide a fair comparison.The evaluation was carried out on 20 occasions to ensure a dependable result.The algorithms produced a cumulative tally of fifty solutions and necessitated one thousand iterations.The examination employed two metrics, namely the mean fitness value (MF) and the standard deviation of the fitness value (SD).Table 2 illustrates a comparative assessment of the recommended IPRE algorithm in contrast to alternative algorithms.Fig. 5.The graphical abstract of the method.
F. Wang et al.
The findings illustrate the IPRE algorithm surpasses the other ones in most of the test functions, particularly in the composition and hybrid functions.The IPRE algorithm achieves the finest outcomes in 7 out of 10 test functions and the second finest findings in 2 out of 10 test functions.Additionally, the IPRE algorithm exhibits the smallest standard deviation and mean values in the majority of the test functions, indicating its high precision and dependability.
The IPRE algorithm is derived from the Improved Partial Reinforcement Effect (IPRE) algorithm, which is an adapted version of the Partial Reinforcement Effect (PRE) algorithm.The IPRE algorithm incorporates several modifications to the original PRE algorithm, including the utilization of a dynamic population size, a self-adaptive reinforcement behavior, a hybrid crossover operator, and a mutation operator that follows a Cauchy distribution.These alterations contribute to enhancing the algorithm's exploration and exploitation capabilities, as well as its adaptability to diverse problem characteristics.
On the other hand, the alternative algorithms are based on distinct nature-inspired algorithms, like the Owl Search Algorithm (OSA), Squirrel Search Algorithm (SSA), BBO (Biogeography-Based Optimizer), LS (Locust Swarm) Optimizer, and Wildebeest Herd Optimization (WHO).These algorithms imitate the behaviors of different animals, such as owls, squirrels, biogeographical species, locusts, and wildebeests, in order to search for optimal solutions.However, these algorithms may encounter certain limitations, such as premature convergence, stagnation, or slow convergence, particularly when dealing with problems characterized by deceptive, multimodal, or noisy landscapes.
Therefore, in accordance with the outcomes presented in Tables 2 and it is clear the IPRE algorithm is a more effective approach.

System configuration
As stated in the previous section, we apply our proposed model (RNN/IPRE), which combines the Rigdelet transform and the IPRE for music genre feature classification.Simulations were conducted on a 16 GB RAM Intel Core i7 Laptop using the MATLAB R2019b software.The datasets were split into test sets and training that 80 % of the information utilized for training and the other 20 % utilized for examining.The graphical abstract of the method is shown in Fig. (5).
As can be observed from Fig. (5), the entire procedure comprises of three main stages: preprocessing, feature extraction, and evaluation.Initially, in the preprocessing phase, the raw audio signals undergo various operations to prepare them for further analysis.Signal pre-emphasis is applied to highlight the high-frequency components, preserving important details.Frame blocking is then used to divide the continuous signal into smaller fragments, allowing for the analysis of short snippets of sound.To address abrupt signal terminations, Hamming windowing is employed to apply a tapered window to each frame.
Following preprocessing, the feature extraction stage involves deriving relevant characteristics from the audio content.A spectrogram is generated, providing a visual representation of changes in the signal's spectral density over time.Frequencies are displayed on the y-axis, while the x-axis represents chronological progression.The color intensities in the spectrogram reflect the magnitude of the dominant frequencies.
Lastly, the evaluation phase utilizes two key elements for genre classification based on spectrograms.Ridgelet Neural Networks are employed to scrutinize the spectrograms using Ridgelet functions, which examine the input data and are combined with Ridgelet parameters such as scale (a), translation (b), and the Ridgelet function (ψ) itself.The Improved Partial Reinforcement Effect (IPRE) Optimizer complements the process by refining variable adjustments to facilitate optimal weight and bias discoveries, thereby enhancing accuracy and performance.
In summary, these interconnected stages leverage the advantages of Ridgelet Neural Networks and enhancement strategies to deliver precise and improved music genre classifications.

Configuration and setting of the parameters
In this study, the method proposed for music genre classification entails various adjustable parameters.The following will delve into the specifics of the configuration and settings for each of these parameters.

-Frame Length and Shift
In the initial stage of preprocessing, the audio signal is divided into frames of a specified length with a certain amount of overlap.The frame length impacts the resolution of the frequency-domain representation, while the overlap influences the smoothness of the transition between successive frames.For this study, the frame length was designated as 1024 samples with a shift of 512 samples, resulting in a 50 % overlap between frames.

-Window Function
To address the discontinuities that arise from abrupt transitions at the edges of each frame, a window function is applied to the frames.While there are various window functions to choose from, the Hamming window was selected for this study due to its favorable balance between spectral leakage and signal distortion.
-The Mel filter bank To transform the frequency axis of the spectrum into a logarithmic scale, mimicking the way the human auditory system perceives pitch.The quantity of filters in the Mel filter bank impacts the level of detail in the transformation and the resulting output.In this study, 40 Mel filters were employed.
-The MFCC coefficients It has been calculated based on the Mel filter bank representation.The number of coefficients dictates the dimensionality of the feature vector.In this research, 13 MFCC coefficients were utilized, a common choice in music genre classification tasks.
-Learning rate and momentum parameters During the training phase, the learning rate and momentum parameters regulate the weight updates in the neural network.The learning rate controls the step size in each iteration, while momentum reduces oscillatory behavior and encourages smoother updates.In this study, the learning rate was initialized at 0.01, and the momentum was set to 0.9.
-The total number of weight updates during training It is determined by the number of iterations, while the mini-batch size specifies the number of examples processed simultaneously.Increasing the number of iterations or decreasing the batch size can enhance the model's alignment with the data, albeit at the cost of longer training times.In this work, 500 iterations were set, and the mini-batch size was 64.
-Dropout It is a regularization technique that helps prevent overfitting by randomly deactivating nodes during training.The dropout rate indicates the likelihood of each node being deactivated.In this study, the dropout rate was established at 0.5.

Dataset description
This study uses GTZAN Dataset for evaluation and analysis of the proposed model.Dataset music genre classification serves as a widely recognized benchmark for tasks related to categorizing genres of music.It comprises a total of 1000 audio tracks, each lasting for 30 s.These tracks are evenly distributed across ten genres, including contains 10 genres: classical, blues, disco, country, jazz, hiphop, pop, metal, rock, and reggae.All audio tracks are encoded as 22050Hz Mono 16-bit audio files in format of wav [25].
The current dataset was initially developed by George Tzanetakis, Georg Essl, and Perry Cook in 2001, and has since been extensively utilized in various research papers and projects.It is readily accessible on Kaggle.
Furthermore, the extracted features encompass a range of spectral, temporal, and cepstral characteristics.These include the spectral centroid, rate of zero-crossing, spectral flux, spectral rolloff, spectral flatness, spectral entropy, spectral contrast, spectral bandwidth, spectral spread, spectral skewness, spectral kurtosis, Root Mean Square energy, low energy, energy entropy, MFCCs, chroma vector, and chroma deviation.These features are commonly employed in music genre classification and can serve as inputs for machine learning models.To access the dataset, it can be downloaded from Kaggle as a zip file from the subsequent link: website: https://www.kaggle.com/datasets/andradaolteanu/gtzan-dataset-music-genre-classification?resource=download.
Table 3 indicates the basic information of the GTZAN dataset.The dataset is well-suited for music genre classification tasks due to its inclusion of a wide array of music genres and diverse data representations.

Experimental results
The proposed RNN/IPREO model in this study demonstrates an accuracy rate of 93 % when evaluated on the training data set, accompanied by a training loss of 0.10.However, when the number of epochs is increased to 15, the model converges rapidly, leading to a decrease in accuracy.As the GTZAN data set is balanced, accuracy is utilized as the scoring metric.It is worth noting that by adjusting the parameters in the future, it is possible to achieve higher accuracy.The suggested RNN/IPREO model's training accuracy during the iterations has been depicted in Fig. (6).Fig. (7) illustrates the loss curve of the RNN/IPREO model.Notably, once the quantity of iterations has been considered to be 16, the model approaches one, resulting in a decline in accuracy.

Comparison analysis
For assessing the efficacy of the suggested RNN/IPREO model for music genre classification, some experiments have been conducted on the beforementioned GTZAN dataset, which serves as a benchmark in this particular domain.
The RNN architecture is designed with an input shape, 128 units in the hidden layer, and a ReLU activation function.On the other hand, the CNN structure features an input shape, 32 filters in the first convolutional layer, and a pooling size of (2,2).Moving on to the DNN model, it shows an input shape, 128 units in the first hidden layer, followed by 64 and 32 units in subsequent layers.Moreover, the VGG-16 and ResNet-50 models come with pretrained weights from ImageNet and some frozen early layers.Lastly, the RNN-LSTM model is characterized by an input shape, 128 LSTM units in the hidden layer, and a tanh activation function.The RNN/IPREO model, on the other hand, has an input shape, 128 units in the hidden layer, and a ReLU activation function.All these models are configured to utilize binary cross-entropy as the loss function, Adam as the optimizer, and are trained for 100 epochs with a batch size of 32.The classification accuracies of our proposed RNN/IPREO model for music genre classification are presented in the following.Table 4 presents the findings that highlight the exceptional accuracy achieved by the proposed RNN/IPREO model in comparison to traditional methods for music genre classification.
As can be observed, by utilizing a combination of spectral centroid, mel-spectrum, and MFCCs as features, the RNN/IPREO model outperforms the K-Means and SVM methods by a significant margin.This indicates that the RNN/IPREO model has the capability to learn more intricate and distinctive features from the sound spectrum, while efficiently and effectively optimizing the parameters using the IPREO algorithm.
As mentioned before, the suggested technique was also contrasted with some related deep models.Table 5 illustrates the findings that highlight the exceptional accuracy achieved by the proposed RNN/IPREO model in comparison to traditional methods for music genre classification.
This achievement is attributed to the hybrid architecture of CNN and Bi-RNN blocks employed in the RNN/IPREO model.The RNN/ IPREO model outperforms both the Neural Network and RNN models by a significant margin, indicating its ability to leverage the spatial attributes extracted via the Convolutional Neural Network block and the temporal attributes extracted via the Bi-RNN block.Furthermore, the RNN/IPREO model surpasses the CNN and DNN models, which solely rely on convolutional or dense layers, respectively.This suggests that the RNN/IPREO model can effectively capture complex and sequential features from the sound spectrum while optimizing parameters more efficiently using the IPREO algorithm.
Notably, the RNN/IPREO model demonstrates comparable performance to well-known and powerful deep models such as VGG-16, ResNet-50, and RNN-LSTM models, which are widely recognized for image and audio classification tasks.In fact, the RNN/IPREO model slightly outperforms the VGG-16 and RNN-LSTM models, and exhibits a similar level of performance to the ResNet-50 model.These results underscore the superiority of the hybrid methodology of Convolutional Neural Network and Bi-RNN, as well as the effectiveness of the IPREO optimization.

Model selection and validation
In order to ensure a thorough and unbiased assessment of the proposed RNN/IPREO model, we implemented a model selection and validation approach utilizing a distinct validation set.IPREO model, the validation set was used to identify the best performing model based on the lowest validation error, and the test set was employed to present the final test outcomes.The performance metrics for RNN/IPREO model on test data are displayed in Table 6.
As can be observed from Tables 6 and It achieves an accuracy of 0.96, correctly classifying 96 % of test samples, which showcases the model's strong predictive capabilities.In terms of precision, the model boasts a rate of 0.96, indicating that 96 % of its positive predictions are accurate, minimizing false positives and ensuring reliable outputs.
With regards to recall, the RNN/IPREO model identifies 96 % of genuine positive cases in the dataset, indicating low false negatives and efficient identification abilities.Sustaining this exceptional performance, the model attains an F1-score of 0.96, reflecting a harmonious balance between precision and recall.Additionally, the model demonstrates a specificity of 0.96, effectively isolating negative instances and exhibiting minimal false positive rates alongside its remarkable recall.With an MCC of 0.92, there is a significant alignment between expected and predicted labels, approaching flawless categorization.Furthermore, the model achieves a 0.99 AUROC, skillfully distinguishing true positive rates from false ones across various threshold settings.Lastly, the RNN/IPREO model rapidly processes predictions at a rate of 0.001 s per sample, making it suitable for real-time applications that require timecritical tasks.Considering this extensive range of metrics, the RNN/IPREO model emerges as a formidable contender for music genre classification tasks, consistently delivering exemplary results.

Discussions
Response: In the simulations discussed in the paper, the authors used 13 Mel-Frequency Cepstral Coefficients (MFCCs) as features for music genre classification.MFCCs are a popular feature extraction technique in speech and audio processing, used to represent the spectral envelope of a signal.The authors chose MFCCs for music genre classification because they effectively capture the salient spectral properties of music signals and provide a compact representation of the audio signal that can be fed into machine learning models for classification.

Table 5
The effectiveness of the proposed RNN/IPREO model with classic methods.
The MFCC feature extraction process involves six main steps.Firstly, the audio signal is filtered with a high-pass filter, known as pre-emphasis, to compensate for the attenuation of high-frequency components in recorded signals.Secondly, the audio signal is divided into overlapping frames, typically with a duration of 20-30 ms and an overlap of 50 %.Thirdly, each frame is transformed into the frequency domain using the Fast Fourier Transform (FFT).Fourthly, a Mel filter bank is applied to the frequency spectrum, converting the linearly spaced frequency bands to Mel-spaced frequency bands, mimicking the non-linear frequency sensitivity of the human ear.
Fifthly, the log energy of each Mel-filtered frequency band is computed.Lastly, the log energies are transformed into the time domain using the Discrete Cosine Transform (DCT), resulting in the MFCC coefficients.Specifically, in the simulations reported in the paper, the authors used 13 MFCC coefficients, which is a common practice in music genre classification.The first coefficient represents the DC offset or the average energy of the signal, while the remaining coefficients capture the spectral shape of the signal.The number of MFCC coefficients used depends on the balance between computational complexity and classification accuracy.Too few coefficients may lead to underfitting, whereas using too many coefficients can result in overfitting, increasing computational complexity without significantly improving classification accuracy.
The proposed approach offers several advantages compared to existing methods.Firstly, it eliminates the need for manual feature selection, which can be time-consuming and may not accurately capture the essential characteristics of music genres.Secondly, it utilizes the sound spectrum as a united indication, allowing for the preservation of both temporal and frequency information in the audio signal.Thirdly, it employs an enhanced RNN that can effectively learn complex and high-level features from the sound spectrum, along with an improved IPREO that optimizes the RNN efficiently.Lastly, our approach demonstrates robustness to noise and variations in the audio signal, enabling it to handle different music genres with distinct characteristics.
However, it is important to acknowledge the limitations and potential future directions of our approach.One limitation is the requirement for a substantial amount of data and computational resources to train the RNN and IPREO.Additionally, our approach may not fully capture the semantic and cultural aspects of music genres, as these aspects often rely on human perception and preference.To address this, one possible future direction is to incorporate prior knowledge or human feedback into our approach, enhancing the interpretability and personalization of music genre classification.Furthermore, extending our approach to other musicrelated tasks, such as music emotion recognition, music similarity measurement, and music generation, presents another promising avenue for future research.

Conclusions
The classification of music genres has been considered to be a crucial task regarding music data retrieval and music recommendation systems.However, many existing methods rely on the manual choice of attributes, which takes a long time and may not accurately capture the fundamental characteristics of different music genres.Within the present study, a novel innovative approach was suggested to music genre classification.The proposed approach involved converting an audio signal into a sound spectrum, which serves as a united indication of the music.Texture features were then extracted from these time-frequency images using an improved Rigdelet Neural Network (RNN).The RNN is a deep learning model that incorporates various techniques, such as gating mechanism, 1D convolution, attention mechanism, and residual connection.These techniques enabled the RNN to learn hierarchical and discriminative features from the sound spectrum.Furthermore, the RNN was optimized using an enhanced version of the partial reinforcement effect optimizer (IPREO).IPREO was a meta-heuristic algorithm that effectively avoided local optima and enhances the generalization ability of the RNN.For assessing the efficacy of our approach, outcomes were carried out on the widely used GTZAN dataset, which is a benchmark for music genre classification.Additionally, the model was compared with traditional manual models, including traditional methods (K-Means, SVM), and deep RCNN (Recurrent Neural Network), CNN (Convolutional Neural Network), VGG-16, RNN-LSTM, DNN (Deep Neural Network), and ResNet-50, Neural Network) models through comparison and ablation experiments.The statistical outcomes demonstrated the suggested model provides superior efficiency compared to the other comparative models in music genre classification.In the future work, a thorough investigation will be investigated in order to examine the influence of various subgenres on the process of feature extraction and classification with more detailed and adaptable models.

F
.Wang et al.

Fig. ( 2 )
shows the feature extraction by the spectral centroids for this signal waveform.Fig. (3) shows the mel-spectrogram of the tested music waveform for the feature extraction.
1, …, L − 1. N specifies the number of retained coefficients.w[n] specifies the window function of length L, x[n] describes the signal of the length N, H signifies the frame shift hop size, X t [k] specifies the DFT of the windowed frame, f describes the frequency in Hertz, f l [m], f c [m], and f u [m] depict the smaller, center, and higher frequencies of the filter m, respectively, and k is the frequency bin index, S t [m] specifies the filter bank, and C t [n] defines the DCT of the log filter bank outputs.

Fig. 3 .
Fig. 3.The mel-spectrogram of the tested music waveform for the feature extraction.

F
. Wang et al.

5 F
. Wang et al.The efficacy of the algorithms is assessed utilizing two measurement tools: the mean function value (MF) and the standard deviation (SD) calculated over 25 iterations.A lower MF and SD indicate better algorithm performance.The most favorable outcomes for each test function are highlighted in bold.

F
.Wang et al.

Fig. 7 .
Fig. 7. Training loss curve of the proposed RNN/IPREO model during the iterations.
The current case happens whenever a promising occasion or outcome has been offered after a definite quantity of responses or manners.Positive reinforcements, for instance, straight prizes or admiration, are comprised to reinforce the manner.Precisely, whenever you do a brilliant presentation at work, you obtain a handout from your chief.-Negativereinforcement: the current kind of reinforcement focuses on eliminating disapproving moments after a definite manner happens.

Table 1
Utilized variable values of the algorithms.

Table 2
Comparative assessment of the recommended IPRE algorithm in contrast to alternative algorithms.

Table 3
basic information of the GTZAN dataset.

Table 4
The effectiveness of the proposed RNN/IPREO model with classic methods.

Table 6
The performance metrics for RNN/IPREO model on test data.